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Abstract 

Product Distribution (PD) theory was recently devel- 
oped as a broad framework for analyzing and optimizing 
distributed systems. Here we demonstrate its use for adap- 
tive distributed control of Multi-Agent Systems (MAS ’s), i.e., 
for distributed stochastic optimization using MAS ’s. First 
we review one motivation of PD theory ; as the information - 
theoretic extension of conventional full- rationality game 
theory to the case of bounded rational agents. In this ex- 
tension the equilibrium of the game is the optimizer of a La- 
grangian of the (probability distribution of) the joint state of 
the agents. When the game in question is a team game with 
constraints , that equilibrium optimizes the expected value of 
the team game utility , subject to those constraints . One com- 
mon way to £nd that equilibrium is to have each agent run a 
Reinforcement Learning (RL) algorithm PD theory reveals 
this to be a particular type of search algorithm for mini- 
mizing the Lagrangian. Typically that algorithm is quite in - 
efEcient. A more principled alternative is to use a variant 
of Newton* s method to minimize the Lagrangian. Here we 
compare this alternative to RL-based search in three sets 
of computer experiments. These are the N Queen *s problem 
and bin-packing problem from the optimization literature , 
and the Bar problem from the distributed RL literature. Our 
results con£rm that the PD-theory-based approach outper- 
forms the RL-based scheme in all three domains. 


1. Introduction 

Product Distribution (PD) theory is a recently intro- 
duced broad framework for analyzing, controlling, and op- 
timizing distributed systems [16, 17, 18]. Among its po- 
tential applications are adaptive, distributed control of a 
Multi-Agent System (MAS), (constrained) optimization, 
sampling of high-dimensional probability densities (i.e., 
improvements to Metropolis sampling), density estimation, 
numerical integration, reinforcement learning, information- 
theoretic bounded rational game theory, population biology, 
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and management theory. Some of these are investigated in 
[2,1,13,11]. 

Here we investigate PD theory’s use for distributed 
stochastic optimization using a MAS (which for our pur- 
poses is the same as adaptive, distributed control of 
a MAS). Often in stochastic approaches to optimiza- 
tion one uses probability distributions to help search 
for a point x G X optimizing a function G(x). In con- 
trast, in the PD approach one searches for a probabil- 
ity distribution P(x) that optimizes an associated La- 
grangian, C(P). Since P is a vector in a Euclidean space, 
the search can be done via techniques like gradient de- 
scent or Newton’s method — even if X is a categorical, 
£nite space. 

One motivation of this approach embodied in PD theory 
starts with the fact that in any game the agents are indepen- 
dent, with each agent i choosing its move x* at any instant 
by sampling its probability distribution (mixed strategy) 
at that instant, qi(xi). Accordingly, the distribution of the 
joint-moves is a product distribution, P(x) = Ili 
In this representation of a MAS, all coupling between the 
agents occurs indirectly; it is the separate distributions of 
the agents {<&} that are statistically coupled, while the ac- 
tual moves of the agents are independent. 

This representation has been adopted implicidy before, 
in algorithms that £nd the equilibria by having each agent 
run its own Reinforcement Learning (RL) algorithm [15, 
6, 10, 20, 21, 19], In these approaches the utility function 
of each agent is based on the world utility G(x) mapping 
the joint move of the agents, x € X , to the performance 
of the overall system. However the agents in a MAS are 
bounded rational. Moreover the equilibrium they reach will 
typically involve mixed strategies rather than pure strate- 
gies, i.e., they don’t settle on a single point x optimizing 
G(x). This suggests formulating an approach that explic- 
itiy accounts for the bounded rational, mixed strategy char- 
acter of the agents. 

This is done in PD theory, which uses information the- 
ory to recast the optimization problem as one of minimizing 
a Lagrangian, £(P), rather than settiing to the equilibrium 
of the game. From the perspective of PD theory, the up- 
date rules used by the agents in RL-based systems are just 



one particular set of (inef£cient) ways of £nding that min- 
imizing distribution [16, 17]. More principled alternatives 
like variants of Newton’s should perform better. In addition, 
such alternatives allow us to leverage well-understood tech- 
niques of convex optimization for incorporating constraints 
over X. In contrast, RL-based schemes typically incorpo- 
rate constraints in an ad hoc fashion, via penalty functions. 

Here we compare this alternative to RL-based search al- 
gorithms in three sets of computer experiments. These ex- 
periments also show how the perspective of PD theory can 
be used to incorporate constraints into RL-based search al- 
gorithms without relying on ad hoc penalty functions. 

In the next section we review the game-theory motivation 
of PD theory. We then present details of our Lagrangian- 
minimization algorithm. We end with computer experi- 
ments comparing this algorithm to some state-of-the-art 
RL-based algorithms. These experiments involve the N 
Queen’s problem and bin-packing problem from the op- 
timization literature, and the Bar problem from the dis- 
tributed RL literature. Our results con£rm that the PD- 
theory-based approach outperforms the RL-based scheme 
in all three domains. 

2. Bounded Rational Game Theory 

In this section we motivate PD theory as the information- 
theoretic formulation of bounded rational game theory. 

2.1. Review of noncooperative game theory 

In noncooperative game theory one has a set of N play- 
ers. Each player i has its own set of allowed pure strate- 
gies. A mixed strategy is a distribution <?;(#*) over player 
V s possible pure strategies. Each player i also has a private 
utility function g t that maps the pure strategies adopted by 
all N of the players into the real numbers. So given mixed 
strategies of all the players, the expected utility of player i 
is E{ gi ) = f dx [], qj(xj)gi(x) K 

In a Nash equilibrium every player adopts the mixed 
strategy that maximizes its expected utility, given the mixed 
strategies of the other players. More formally, Vz, q { = 
argmax^/ f dx q[ Qj (xj) 9 i{x). Perhaps the major ob- 
jection that has been raised to the Nash equilibrium con- 
cept is its assumption of full rationality [8, 9, 3]. This is 
the assumption that every player i can both calculate what 
the strategies q ^ will be and then calculate its associated 
optimal distribution. In other words, it is the assumption 
that every player will calculate the entire joint distribution 
q[x) — IIj Qj( x j )* f° r no other masons than computa- 


1 Throughout this paper, the integral sign is implicitly interpreted as ap- 
propriate, e.g., as Lebesgue integrals, point-sums, etc. 


tional limitations of real humans, this assumption is essen- 
tially untenable. 

2.2. Review of the maximum entropy principle 

Shannon was the £rst person to realize that based on any 
of several separate sets of very simple desiderata, there is a 
unique real- valued quantification of the amount of syntac- 
tic information in a distribution P(y). He showed that this 
amount of information is (the negative of) the Shannon en- 
tropy of that distribution, S(P ) = — f dy P(y)ln [~ ^]. 
So for example, the distribution with minimal information 
is the one that doesn’t distinguish at all between the various 
y, i.e., the uniform distribution. Conversely, the most infor- 
mative distribution is the one that speci£es a single possi- 
ble y. Note that for a product distribution, entropy is addi- 
tive, i.e., SdL = £i Sfa). 

Say we given some incomplete prior knowledge about 
a distribution P(y ). How should one estimate P(y) based 
on that prior knowledge? Shannon’s result tells us how to 
do that in the most conservative way: have your estimate of 
P(y) contain the minimal amount of extra information be- 
yond that already contained in the prior knowledge about 
P(y). Intuitively, this can be viewed as a version of Oc- 
cam’s razor. This approach is called the maximum entropy 
(maxent) principle. It has proven useful in domains rang- 
ing from signal processing to supervised learning [5, 12]. 

2.3. Maxent Lagrangians 

Much of the work on equilibrium concepts in game the- 
ory adopts the perspective of an external observer of a game. 
We are told something concerning the game, e.g., its utility 
functions, information sets, etc., and from that wish to pre- 
dict what joint strategy will be followed by real-world play- 
ers of the game. Say that in addition to such information, 
we are told the expected utilities of the players. What is our 
best estimate of the distribution q that generated those ex- 
pected utility values? By the maxent principle, it is the dis- 
tribution with maximal entropy, subject to those expectation 
values. 

To formalize this, for simplicity assume a £nite num- 
ber of players and of possible strategies for each player. 
To agree with the convention in other £elds, from now on 
we implicitly nip the sign of each Qi so that the associated 
player i wants to minimize that function rather than maxi- 
mize it. Intuitively, this nipped gi{x ) is the “cost” to player 
i when the joint-strategy is x, though we will still use the 
term “utility”. 

Then for prior knowledge that the expected utilities of 
the players are given by the set of values {ej, the max- 
ent estimate of the associated q is given by the minimizer of 


the Lagrangian 

C(q) = J>[£ :,(*)- ei]-Sfa) (1) 

i 

= ^2 0i[ f dx Y[qj (xj )gi (x) - u] ~ S(q\ 2) 

» j 

where the subscript on the expectation value indicates that 
it evaluated under distribution q , and the {Pi} are “inverse 
temperatures” implicitly set by the constraints on the ex- 
pected utilities. 

Solving, we £nd that the mixed strategies minimizing the 
Lagrangian are related to each other via 

qi {xi) oc e' E 'M (G[Xi) (3) 

where the overall proportionality constant for each i is set 
by normalization, and G = Y^i Pi9i 2 - Eq. 3 the probabil- 
ity of player i choosing pure strategy x* depends on the ef- 
fect of that choice on the utilities of the other players. This 
renects the fact that our prior knowledge concerns all the 
players equally. 

If we wish to focus only on the behavior of player i, it is 
appropriate to modify our prior knowledge. To see how to 
do this, £rst consider the case of maximal prior knowledge, 
in which we know the actual joint-strategy of the players, 
and therefore all of their expected costs. For this case, triv- 
ially, the maxent principle says we should “estimate” q as 
that joint-strategy (it being the q with maximal entropy that 
is consistent with our prior knowledge). The same conclu- 
sion holds if our prior knowledge also includes the expected 
cost of player L 

Modify this maximal set of prior knowledge by remov- 
ing from it speci£cation of player i ’s strategy. So our prior 
knowledge is the mixed strategies of all players other than 
i, together with player V s expected cost We can incorpo- 
rate prior knowledge of the other players’ mixed strategies 
directly, without introducing Lagrange parameters. The re- 
sultant maxent Lagrangian is 

£*fe) — Pi[^i ~ E{9i)\ ~ 

= fob - f dx n«(«, )*(.)] - 

j j 

The £rst term in Ci is minimized by a perfectly ratio- 
nal player. The second term is minimized by a perfectly 
irrational player, i.e., by a perfectly uniform mixed strat- 
egy qi. So Pi in the maxent Lagrangian explicitly speci£es 
the balance between the rational and irrational behavior of 
the player. In particular, for /? — + oo, by minimizing the La- 
grangians we recover the Nash equilibria of the game. More 


2 The subscript on the expectation value indicates that it is evalu- 
ated according the distribution Ylj^i ^ * 


formally, in that limit the set of q that simultaneously min- 
imize the Lagrangians is the same as the set of delta func- 
tions about the Nash equilibria of the game. The same is 
true for Eq. 3. 

Eq. 4 is solved by a set of coupled Boltzmann distribu- 
tions: 

g<(* i )oce“ AB *« (wl * i) . (4) 

Following Nash, we can use Brouwer’s £xed point theorem 
to establish that for any non-negative values {/?}, there must 
exist at least one product distribution given by the product 
of these Boltzmann distributions (one term in the product 
for each i). 

Eq. 3 is just a special case of Eq. 4, where all player’s 
share the same private utility, G. (Such games are known 
as team games.) This relationship renects the fact that for 
this case, the difference between the maxent Lagrangian and 
the one in Eq. 2 is independent of Due to this relation- 
ship, our guarantee of the existence of a solution to the set 
of maxent Lagrangians implies the existence of a solution of 
the form Eq. 3. Typically players will be closer to minimiz- 
ing their expected cost than maximizing it. For prior knowl- 
edge consistent with such a case, the are all non-negative. 

For each player i de£ne 

fi{x, qi(xi)) = fiigi(x) + ln[gf(xi)]. (5) 

Then we can maxent Lagrangian for player i is 

A(?) = J dx q(x)fi{x,qi(xi)). ( 6 ) 

Now in a bounded rational game every player sets its strat- 
egy to minimize its Lagrangian, given the strategies of the 
other players. In light of Eq. 6, this means that we inter- 
pret each player in a bounded rational game as being per- 
fectly rational for a utility that incorporates its computa- 
tional cost. To do so we simply need to expand the domain 
of “cost functions” to include probability values as well as 
joint moves. 

Often our prior knowledge will not consist of exact spec- 
i£cation of the expected costs of the players, even if that 
knowledge arises from watching the players make their 
moves. Such alternative kinds of prior knowledge are ad- 
dressed in [17, 18]. Those references also demonstrate the 
extension of the formulation to allow multiple utility func- 
tions of the players, and even variable numbers of players. 
Also discussed there are semi-coordinate transformations, 
under which, intuitively, the moves of the agents are modi- 
£ed to set in binding contracts. 

3, Optimizing the Lagrangian 

In this paper we consider two algorithms for optimizing 
the Lagrangian. The £rst is Brouwer updating, which under 


different names is perhaps the most common scheme em- 
ployed in RL-based algorithms for £nding game equilibria. 
The second is a variant of Newton’s method for directly de- 
scending the Lagrangian. 

3.1. Brouwer updating 

One crude way to try to £nd the q given by Eq. 4 would 
be an iterative process akin to the best-response scheme of 
game theory [8]. Given any current distribution q, in this 
scheme all agents i simultaneously replace their current dis- 
tributions. In this replacement each agent i replaces g* with 
the distribution given in Eq. 4 based on the current q ^ . This 
scheme is the basis of the use of Brouwer’s £xed point the- 
orem to prove that a solution to Eq. 4 exists. 

This scheme requires estimating a conditional expected 
utility for each agent at each iteration. These can be esti- 
mated via Monte-Carlo sampling across a block of time in 
which q is £xed. During that block the agents all repeat- 
edly and jointly HD sample their probability distributions 
to generate joint moves, and the associated utility values 
recorded. This is exactly what is done in RL-based schemes 
in which each agent maintains a data-based estimate of its 
utility for each of its possible moves, and then chooses its 
actual move stochastically, by sampling a Boltzmann distri- 
bution of those estimates. 

Since accurate estimates usually requires extensive sam- 
pling, we replace the G occurring in each agent V s update 
rule with a private utility g £ chosen to ensure that the Monte 
Carlo estimation of E(gi \ x») both low bias (with respect to 
estimating E(G\xi) and low variance [7]. Intuitively, this 
bias reflects the alignment between the private and world 
utilities. At zero bias, reducing private utility necessarily 
reduces world utility. Variance instead reflects how much 
the utility depends of on the agent’s own move rather than 
that the other agents. With low variance, the agents can per- 
form the individual optimizations accurately with minimal 
Monte-Carlo sampling. 

In this paper we concentrated on two types of private 
utility in addition to the team game (TG) utility. The £rst is 
the Aristocrat Utility (AU) utility. It is a correction to one 
of the same name previously investigated in the RL litera- 
ture (see [20, 19, 21] and references therein). It is the utility, 
out of all those guaranteed to have zero bias, that has mini- 
mal variance: 

*-, x 

gAuMuXd)) - G(*i,*(i)) - Ex' s p *-i G(a;',x (i )) 

1 Z-Zx'/ x" 

(7) 

where N x . is the number of times that agent i makes move 
Xi in the most recent set of Monte Carlo samples. Due to the 
subtracted term, AU should have lower variance than TG. 

In addition we consider the Wonderful Life Utility 
(WLU), which is an approximation to AU that also has zero 


bias: 

9 wLUi(xi,x {i) ) = G(x u X(i )) - G(CL u x {i) ) (8) 

where the clamping value CL* £xes agent V s move to the 
one to which it assigns lowest probability action [16, 18]. 
(Again, this is a correction to a utility of the same name pre- 
viously investigated in [20, 19, 21] and references therein.) 

However the utilities are estimated, one obvious prob- 
lem with Brouwer updating is that there is no a priori rea- 
son to believe that it will converge. Implicitly acknowledg- 
ing this, in practice the Monte Carlo samples are “aged”, 
to weight older sample points less heavily than more re- 
cent points. See [20, 19, 21] for details. This modi£cation 
still provides no formal guarantees however. Such guaran- 
tees do obtainthough if rather than conventional “parallel” 
Brouwer updating, one uses “serial Brouwer updating”, in 
which only one agent at time updates its distribution. Other 
alternatives are mixed serial-parallel Brouwer updating. See 
[16, 18] for a discussion of such technqiues. Regardless of 
what type of Brouwer updating one uses however, its intrin- 
sic nature is make no use whatsoever of the many power- 
ful techniques known for descending across functions like 
C{q) to £nd its minimum. This is not the case with the vari- 
ant of Newton’s method discussed below. 

3.2. Constrained Newton’s descent 

Typically in the RL-based work employing Brouwer up- 
dating, constraints are introduced by ad hoc use of penalty 
functions. However an alternative is provided by the 
straightforward extension of the PD framework to con- 
strained optimization. Given that the agents in a MAS 
are bounded rational, if we have them play a con- 
strained team game with world utility G, their equilibrium 
will be the optimizer of G subject to those (potentially inex- 
act) constraints [16, 18]. Formally, let {cj(x)} be the con- 
straint functions, i.e., we seek a joint-move x such that all 
of the (cj(x)} are nowhere negative. Then the bounded ra- 
tional equilibrium will minimize the Lagrangian of Eq. 
2 where the world world utility is augmented with La- 
grange multipliers, A j, for each of the 

G(z)-G{x) + 'Z j *jC j (x). (9) 

Consider a £xed set of values for the Lagrange parame- 
ters. We can minimize the associated Lagrangian using gra- 
dient descent, since the gradient can be evaluated in closed 
form. We can also evaluate the Hessian in closed form. 
This allows us to use constrained Newton’s method. This 
is a variant of Newton’s method in which we £rst mod- 
ify the Lagrangian, and then enforce both independence of 
the agents, and that the search stays on the simplex of valid 
probabilities [18, 2]: 



Qi(xi) -» Qi{Xi)~ 

q l (x t )[(E[G\x t ] - E\0\) + S(qi) + lnq x {xi)\ (10) 

where a plays the role of a step size. 

The Lagrange multipliers are then updated in the usual 
way, by taking the partial derivatives of the augmented La- 
grangian: 

Aj — > \j -f 6E[cj(x)\ (11) 

where 5 is the step size. 

Just as in Brouwer updating, to evaluate the update of 
Eq. 10 we need to estimate conditional expected utilities of 
each agent. Here we use the exact same Monte Carlo-based 
algorithms and private utilities used in Brouwer updating. 

4. Experiments 
4.1. Queens Problem 

The N-queens problem is not hard to solve, especially 
with centralized algorithms [141. However it is a good illus- 
tration and testbed of the PD-theory approach. The goal in 
this problem is to locate N queens on a N-by-N chessboard 
such that there are no connicts between any of the queens, 
i.e., no shared rows, columns or diagonals. For the results 
presented here, N = 8 and each agent’s move is the posi- 
tion of a queen on an associated row of the chessboard. De- 
noting agent i’s making move j as x t (j), the constraints are 

*<(?') + Xkti) 

Xi ij ) 7^ X i+k (j + fc) 5^ Xi- k (j h) 

x i(j) ¥= Xi+h(j -k) ^ + k) 

For 8 queens this results in 84 constraints. 

For this study the step size a was set to 1.0, while the 
data aging rate 7 was set to 0.5. The optimizations were 
performed at a range of £xed “temperatures” T = /3~ l . 10 
Monte-Carlo samples were used for each probability and 
Lagrange multiplier update. We concentrated on the num- 
ber of iterations to convergence, i.e., the number of prob- 
ability updates times the number of Monte-Carlo samples 
per update, for 50 random trials of the problem. The opti- 
mization was terminated when a single Monte-Carlo sam- 
ple within an iteration was found which satis£ed all of the 
constraints. 

In other work we have used this problem to validate 
the predictions of PD theory about the relative merits of 
our three utilities [13]. Here we concentrate on comparing 
constrained Newton descent with an improved version of 
Brouwer, in which the constrained are implemented with 
Lagrange multipliers updated according to Eq. 11 rather 
than with penalty functions. So in these experiments, only 
the method for updating the probability distributions dif- 
fered from that of constrained Newton updating. Tthe step 




Figure 1. Comparison of RL-based and PD 
theory-based equilibration methods for the 
Queens problem. The top figure presents the 
fraction of trials which successfully solved 
the problem, the second Egure present the 
mean number of iterations to that solution 
when it was arrived at, and the bottom Eg- 
ure is the associated 95% confidence value. 


size a was set to 1.0, while the data aging rate 7 was set to 
0.5. The optimizations were performed at a range of £xed 
temperatures and 10 Monte-Carlo samples were used for 
each probability and Lagrange multiplier update. 

Changing the method for updating the probability distri- 
butions from the constrained Newton approach to parallel 
Brouwer degraded the completion rate, as indicated by Fig- 
ure 1. However since we modi£ed Brouwer updating to in- 
corporate the constraints using Lagrange parameters rather 
than penalty functions, the improvement was not as pro- 
nounced as it might be. In deed, the same £gure shows that 
over some temperatures, constrained Newton does not out- 
perform parallel Brouwer updating. 


4.2. Bar Problem 

A modi£ed version of Arthur’s El Farol Bar Problem has 
been used before to investigate the RL-based approach [19]. 
Here we use that same problem to compare constrained 
Newton updating and parallel Brouwer updating on an un- 
constrained problem optimization. 

In this scenario there are N agents, each selecting one of 
seven nights to attend a bar, i.e., each agent has 7 categori- 





cal moves. The world utility function is given by 

7 

G( 0 = 5>(z fe (0), (12) 

k - 1 

where Xk{Q is the total attendance on night k ; and </>(y) = 
y exp {—y/c) with c a real- valued parameter. This choice of 
<t>(.) means that either too few or too many agents attend- 
ing a single night results in low world utility G. The pa- 
rameter c is set to control the size of this congestion effect. 
For the results presented here, N was set to 168 with c set 
to 6 . Note that here the goal is to maximize G, which re- 
sults in appropriate sign nips in the Lagrangian search. 

Previous work with this problem had illustrated the ad- 
vantages of using modi£ed private utilities rather than the 
team game with parallel Brouwer updating [20, 19]. The 
goal of the current work is to test the advantages of con- 
strained Newton over parallel Brouwer updating. Results 
were generated over a range of £xed temperatures to high- 
light the difference. Figure 2 shows the comparison between 
constrained Newton and parallel Brouwer over the temper- 
ature range. Shown is the performance (value of G) after 
1000 iterations averaged over 20 cases. The 95% con£dence 
intervals in all cases is less than 0.5. Constrained Newton is 
seen to outperform parallel Brouwer at all temperatures. 

More detail is provided in Figure 3. This £gure compares 
the two updating schemes as the number of iterations in- 
creases, for a £xed temperature T of 0.01. The constrained 
Newton scheme initially improves much faster, with par- 
allel Brouwer catching up several hundred iterations later. 
However constrained Newton then continues to improve. In 
contrast, parallel Brouwer updating oscillates, without sig- 
ni£cant improvement. Also note the tighter con£dence bars 
on the constrained Newton results, renecting higher robust- 
ness. The step size a for the results shown was £xed at 0.01 
while the data aging rate 7 was held at 0.1. Wonderful Life 
Utility (WLU) was used with a Monte-Carlo block size of 
1 . Other values for these parameters and other utilities were 
also considered and resulted in similar trends. 
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Figure 2. Comparison of RL-based and PD 
theory-based equilibration methods for the 
Bar problem. Performance is measured at the 
end of the run. Higher is better. 


http://mscmga.ms.ic.ac.uk/info.html /citeBeas90. The in- 
stances consisted of 60 items to be packed in groups of three 
into 20 bins each of capacity 100. Since in general the min- 
imum number of bins is not known, the move space of the 
agents was set to the number of items. So the world util- 
ity is 


f EteiM if Xi <c 

1 I)£Li I Si - c| if Xi > 


where Xi is the total size of the items in bin i. 

To mimic the use of penalty functions common with RL- 
based schemes, we also ran experiments in which the al- 
gorithms did not use this world utility directly for our up- 
dating algorithms (although we used it to measure perfor- 
mance). Instead we added an additional penalty term to G 
to smoothly enforce the constraint on the number of bins: 


43. Bin Packing Problem 


Gadded. — 1000 filled N optimum , ) (14) 


We also compared the two updating methods on an in- 
termediate problem, in which rather than just try to solve a 
set of constraints (as in the Queen’s problem) or try to op- 
timize an unconstrained problem (as in the Bar prob- 
lem), we try to optimize a discrete constrained problem 
[4]. We chose the bin packing problem for this conmpar- 
ison. This problem consists of assigning N items of dif- 
fering sizes into the smallest number of bins each with ca- 
pacity c. For the current study instances were chosen 
which have a designed minimum number of bins /cite- 
Falk94 and were obtained from the OR-Library at 


This provides a more meaningful comparison between con- 
strained Newton and typical multi-agent techniques using 
parallel Brouwer updating. 

Two different kinds of constrained Newton were ex- 
plored in our experiments. The £rst used the modi£ed G 
in conjuntion with constrained Newton approach for updat- 
ing the probabilities. The second did not add the penalty 
function but instead used Lagrange parameters to enforce 
the hard constraints on bin levels that Xi <c for all L 

For each problem variant 20 cases were used in deter- 
mining the averages and error bars. Figure 4 compares all 





iteration 



Figure 3. Comparison of RL-based and 
PD theory-based equilibration methods for 
the Bar problem. Performance is measured 
against time, with temperature T £xed to 0.01 
for both schemes. 


three schemes. The £rst plot shows the average number of 
bins over the optimum found by each approach. Constrained 
Newton with a penalty function performed best here. The 
second plot shows the number of bins over capacity while 
the third plot shows the amount of total overcapacity. Note 
that the unique optimum is no bins over the optimum and 
no bins over capacity. The results indicate that the paral- 
lel Brouwer and the constrained Newton result in similar 
numbers of over capacity bins, but the constrained Newton 
has much lower total over capacity. The constrained New- 
ton with constraints shows even better performance, result- 
ing in veiy few over capacity bins and on average only 5 
bins over the optimum. 

5. Conclusion 

Product Distribution (PD) theory is a broad framework 
recently developed for analyzing and optimizing distributed 
systems. It can be derived as the information-theoretic ex- 
tension of conventional full-rationality game theory to the 
case of bounded rational agents. Here we demonstrate its 
use for adaptive distributed control of MAS’s i.e., for dis- 
tributed stochastic optimization using MAS’s. 

In PD theory the bounded rational equilibrium of the 
game is the optimizer of a Lagrangian of the (probability 
distribution of) the joint state of the agents. When the game 
in question is a team game with constraints, that equilibrium 
optimizes the expected value of the team game utility, sub- 


Figure 4. Comparison of RL-based and PD 
theory-based equilibration methods for the 
bin packing problem. 


ject to those constraints. One common way to £nd that equi- 
librium is to have each agent run a Reinforcement Learn- 
ing (RL) algorithm. PD theory reveals this to be a particu- 
lar type of search algorithm for minimizing the Lagrangian. 
Typically that algorithm is quite inefficient A more princi- 
pled alternative is to use a variant of Newton’s method to 
minimize the Lagrangian. Here we compare this alternative 
to RL-based search in three sets of computer experiments. 
These are the N Queen’s problem and bin-packing prob- 
lem from the optimization literature, and the Bar problem 
from the distributed RL literature. Our results con£rm that 
the PD-theory-based approach outperforms the RL-based 
scheme in all three domains. 
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